2024/01/10

Part I: EDA with WDI

Exploratory Data Analysis

  • Statistics vs Data Science

  • Hypothesis Testing vs Hypothesis Creating

  • Problem Solving vs Understanding Issues (to Attack Problems)

A short introduction to EDA workflow

  1. Start RStudio

  2. Create or open a project

    • File > Open Project or File > Recent Projects (or File > New Project)
  3. R Notebook, R Markdown, Quarto (or R Script, Console)

    • Reproducibility, Literate Programming and for Communication

    • R Notebook: HTML file with R Markdown source file

Examples

  • Purchasing power parities (PPPs) [Link], [Rmd]

  • Life Expectancy [Link], [Rmd]

Review - Week 2

World Development Indicators

  • Data themes [Link]

  • People [Link]

    • Life-expectancy

      • R Notebook [Link to the source], [R Notebook]

      • World Development Indicators: life-expectancy

        • Population, total: SP.POP.TOTL [Link]

        • Fertility rate, total (births per woman): SP.DYN.TFRT.IN [Link]

        • Life expectancy at birth, total: SP.DYN.LE00.IN [Link]

Preview

  • Week 3, Jan. 10: EDA2, WDI, data transformation, data visualization

  • Week 4, Jan. 17: EDA3, WDI, tidy data (long and wide data), choropleth maps

  • Week 5, Jan. 24: EDA4, UN, OECD, readr, readxl, two table verves

  • Week 6, Jan. 31: EDA5, Round-up, communication, pdf, Word, PowerPoint

WDI Package

CRAN Package Site: WDI [https://CRAN.R-project.org/package=WDI]

WDI: World Development Indicators and Other World Bank Data

The WDI function provides convenient access to over 40 databases hosted by the World Bank, including the World Development Indicators (WDI), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national poverty indicators. For fast searching, the WDI package ships with a local list of available data series. This local list can be updated to the latest version using the WDIcache function.

Search: CRAN package=WDI.

See URL, News and Reference Manual in the page.

For other packages use a similar syntax ‘CRAN package=’Package Name’

Data Importing

df_lifeexp <- WDI(indicator = c(pop = "SP.POP.TOTL", 
                  fertility = "SP.DYN.TFRT.IN", 
                  lifeexp = "SP.DYN.LE00.IN"), 
                  extra = TRUE)

Default values are filled automatically. See Help WDI

df_lifeexp <- WDI(country = "all", 
                  indicator = c(pop = "SP.POP.TOTL", 
                  fertility = "SP.DYN.TFRT.IN", 
                  lifeexp = "SP.DYN.LE00.IN"), 
                  start = 1960, end = NULL, 
                  extra = TRUE, cache = NULL, latest = NULL, 
                  language = "en")

Data Transforming

  • dplyr, a package in tidyverse

    • select() : select columns select(c(1,2,4,5,7,8))

    • filter() : select rows meeting a condition

      • filter(country == "World")

      • filter(iso2c %in% c("JP", "IN", "CN"))

      • filter(region != "Aggregates")

    • distinct() : select rows with distinct values of a variable distinct(country)

    • drop_na() : drop rows with NA values, e.g. drop_na(pop)

Data Visualizing

  • ggplot2

    • ggplot(aes(x=var1, y=var2)) +

    • Scatter Plot: geom_point(aes(col=var3, size=var4, shape=var5))

    • Line Graph: geom_line(aes(col=var3, linetype=var4)))

    • Title, Subtitle, Legend and Axis labels

      • labs(title = "", subtitle ="", x = "", y = "", col = "")

Examples [Link]

Line graph

df_life |> filter(country == "World") |> 
  drop_na(life_expectancy) |>
  ggplot(aes(year, life_expectancy)) + geom_line() +
  labs(title = "Life expectancy of the World")

Scatter Plot

df_life |> filter(year == 2021) |>
  filter(region != "Aggregates") |> 
  drop_na(fertility_rate, life_expectancy) |>
  ggplot(aes(fertility_rate, life_expectancy, col = region)) + 
  geom_point()

World Development Indicators

  • Data themes [Link]

  • Data Catalog: World Development Indicators [Link]

  • The World by Income and Region [Link]

  • World Bank Country and Lending Groups [Link]

  • WDI Package

    • WDI(), WDIcache(), WDIsearch()

WDIcache() [Link in da4r]

See Help.

wdicache <- WDIcache()
str(wdicache)
## List of 2
##  $ series :'data.frame': 24460 obs. of  5 variables:
##   ..$ indicator         : chr [1:24460] "1.0.HCount.1.90usd" "1.0.HCount.2.5usd" "1.0.HCount.Mid10to50" "1.0.HCount.Ofcl" ...
##   ..$ name              : chr [1:24460] "Poverty Headcount ($1.90 a day)" "Poverty Headcount ($2.50 a day)" "Middle Class ($10-50 a day) Headcount" "Official Moderate Poverty Rate-National" ...
##   ..$ description       : chr [1:24460] "The poverty headcount index measures the proportion of the population with daily per capita income (in 2011 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income below the of"| __truncated__ ...
##   ..$ sourceDatabase    : chr [1:24460] "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" ...
##   ..$ sourceOrganization: chr [1:24460] "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of data from National Statistical Offices." ...
##  $ country:'data.frame': 297 obs. of  9 variables:
##   ..$ iso3c    : chr [1:297] "ABW" "AFE" "AFG" "AFR" ...
##   ..$ iso2c    : chr [1:297] "AW" "ZH" "AF" "A9" ...
##   ..$ country  : chr [1:297] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
##   ..$ region   : chr [1:297] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
##   ..$ capital  : chr [1:297] "Oranjestad" "" "Kabul" "" ...
##   ..$ longitude: chr [1:297] "-70.0167" "" "69.1761" "" ...
##   ..$ latitude : chr [1:297] "12.5167" "" "34.5228" "" ...
##   ..$ income   : chr [1:297] "High income" "Aggregates" "Low income" "Aggregates" ...
##   ..$ lending  : chr [1:297] "Not classified" "Aggregates" "IDA" "Aggregates" ...

str(wdicache, max.level=1)
## List of 2
##  $ series :'data.frame': 24460 obs. of  5 variables:
##  $ country:'data.frame': 297 obs. of  9 variables:

wdicache$series

wdicache$country

Write and/or Read Lists

write_rds(wdicache, "data/wdicache.rds")
wdicache <- read_rds("data/wdicache.rds")

WDIsearch() [Link in da4r]

WDIsearch(string = "education", field = "name", short = TRUE, cache = wdicache)

wdibulk <- WDIbulk(timeout=600)

List of 6
 $ Data          : tibble [24,902,388 × 6] (S3: tbl_df/tbl/data.frame)
 $ Country       :'data.frame': 265 obs. of  31 variables:
 $ Series        :'data.frame': 1486 obs. of  21 variables:
 $ Country-Series:'data.frame': 8241 obs. of  4 variables:
 $ Series-Time   :'data.frame': 148 obs. of  4 variables:
 $ FootNote      :'data.frame': 741689 obs. of  5 variables:

group_by, summarize [Link in da4r]

df_iris <- datasets::iris
df_iris |> group_by(Species) |>
  summarize(mean_sl = mean(Sepal.Length), mean_sw = mean(Sepal.Width))

mutate [Link in da4r]

df_iris |> mutate(Sepal.Ratio = Sepal.Length/Sepal.Width)

Part II: Practicum

File Links

Week Three Assignment

Choose two WDI codes and analyse the data.

  • Create an R Notebook of a Data Analysis containing the following and submit the rendered file (eg. w3_g123456.nb.html)

    1. create an R Notebook using the R Notebook Template in Moodle, save as w3_g123456.Rmd,

    2. edit author with name, ID, and title

  • Contents should include the following:

    • A short abstract

    • Description of data including data name, data code (or id), description

    • A bar graph or a column graph, a histogram, a line graph, a scatter plot

    • Observations or questions for visualizations

  • run each code block, preview to create w3_g123456.nb.html,

  • submit w3_123456.nb.html to Moodle.

Submit your R Notebook file (w3_g123456.nb.html) in Moodle (Week Three Assignment).

Due Sunday 14, January 2024, 11:59 PM

Recommended Readings and Activities